## main packages
import requests, pandas as pd, numpy as np, geopandas as gpd, json
## packages for visualisations
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import altair as alt
## packages for modelling
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor # used in 4.2
from scipy.stats import pearsonr # used in 4.2 for correlations
Run on M1 MacBook in Conda environment running Python 3.9.15 (Had issues running >3.10)
This project explores London crime behaviour using the freely available Crime API published by the UK police.
Specifically, the crime data is downloaded from the https://data.police.uk/api/crimes-street/ endpoint. This allows querying by crime category, location, and monthly date.
'all-crimes' can be passed as a category to retrieve data for all categorys. Locations can be specified as a coordinates pair, which returns crime incidents within a 1-mile radius, or as a set of coordinate pairs to create a polygon within which crimes will be returned. The api holds data from the last 3 years, with results aggregated in each month. As of 03/12/2022, the period available is October 2019 -> September 2022.
There are some limitations to consider:
To allow for easier categorisation and to stay within the results limit, the api is called for multiple areas and combined into a main Greater London dataset. Results by borough are suitably small enough to stay within the limit, but could equally be called for smaller areas, such as wards, MSOA or LSOA.
Specifying these areas requires a set of coordinates making up each area's perimeter, so the data gathering follows the process:
1. Find coordinates for London area
2. Format into lat/lng pairs akin to api spec and retrieve crime data
3. Repeat for multiple months to create time-series
4. Repeat steps 1-3 for each area
Gathering location data can be near-automated by selecting a commonly used geographical heirarchy. For example, some commonly used hierarchies are Local Authority Districts (ie London Boroughs), Electoral Wards, Middle layer Super Output Area (MSOA, around 7500 inhabitants), Lower layer Super Output Area (LSOA, around 1500 inhabitants) or Output Area (OA, around 300 inhabitants).
Shapefiles for mapping are easily available from variety of sources and can be used to extract the coordinates for the desired areas. These are generally available at a national level, but other pre-filtered files can often be found for important areas.
ArcGIS hosts a huge database of map files, most of which can be manually downloaded in multiple file formats or accessed through their api.
Using a get request, the 'borough boundaries' dataset can be downloaded in GeoJSON format - a version of JSON for storing geographic data structures. Then, using the geopandas python module, this is loaded into a GeoDataFrame - a pandas dataframe with a geometry column. Each row represents a geometric object, for instance a London borough, so any accompanying information in the mapping file can be easily explored.
boroughs = requests.get('https://services.arcgis.com/drifeOPKLpgnJ8Qa/arcgis/rest/services/borough boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson')
df_boroughs = gpd.GeoDataFrame.from_features(boroughs.json())
## borough boundaries also available in external_data folder incase dataset/API becomes unavailable through that link.
# df_boroughs = gpd.read_file('external_data/Borough_Boundaries.geojson')
df_boroughs.tail()
| geometry | FID | ogc_fid | name | gss_code | hectares | nonld_area | ons_inner | sub_2011 | Shape__Area | Shape__Length | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 28 | POLYGON ((-0.09764 51.57365, -0.09753 51.57368... | 29 | 29 | Hackney | E09000012 | 1904.902 | 0.000 | T | East | 4.921020e+07 | 40792.587675 |
| 29 | POLYGON ((-0.14239 51.56912, -0.14239 51.56928... | 30 | 30 | Haringey | E09000014 | 2959.837 | 0.000 | T | North | 7.659002e+07 | 46679.448320 |
| 30 | POLYGON ((0.02907 51.49609, 0.02780 51.49602, ... | 31 | 31 | Newham | E09000025 | 3857.806 | 237.637 | T | East | 9.954356e+07 | 51121.484382 |
| 31 | POLYGON ((0.09973 51.51190, 0.09976 51.51358, ... | 32 | 32 | Barking and Dagenham | E09000002 | 3779.934 | 169.150 | F | East | 9.760414e+07 | 59416.221750 |
| 32 | POLYGON ((-0.11158 51.51534, -0.11184 51.51580... | 33 | 33 | City of London | E09000001 | 314.942 | 24.546 | T | Central | 8.122667e+06 | 15417.507853 |
Each geometry object contains a polygon of coordinates, along with the relevant names and codes. So to obtain coordinates, the polygon responding to each area must be extracted and cleaned into the correct format.
Can also extract some columns to add as features to main data later.
## GeoDataFrame so convert GDF columns to lists first then create columns in new DF
b_names = df_boroughs['name']
b_area = df_boroughs['hectares']
b_inner = df_boroughs['ons_inner']
b_compass = df_boroughs['sub_2011']
borough_feat = pd.DataFrame()
borough_feat['Borough'] = b_names
borough_feat['Hectares'] = b_area
borough_feat['Inner'] = b_inner
borough_feat['Area'] = b_compass
## replace t and f with True and False
borough_feat['Inner'] = (borough_feat['Inner'] == 'T' )
borough_feat.tail()
| Borough | Hectares | Inner | Area | |
|---|---|---|---|---|
| 28 | Hackney | 1904.902 | True | East |
| 29 | Haringey | 2959.837 | True | North |
| 30 | Newham | 3857.806 | True | East |
| 31 | Barking and Dagenham | 3779.934 | False | East |
| 32 | City of London | 314.942 | True | Central |
## create a list of all the area names to iterate through
areas = df_boroughs['name'].tolist()
For each area in list:
## define function that extracts polygon coordinates from dataframe
def get_coords(df):
## extract coordinates: 1. convert into dictionary, 2. extract coordinates into numpy array
dict = json.loads(df.to_json())
coords = np.array(dict['features'][0]['geometry']['coordinates'])
## round coordinates to 6dp - roughly 0.11m precision (improves api performance)
coords = np.round_(coords, decimals = 6)
## build polygon string by appending coordinates in reverse order (lat,lon:lat,lon:lat,lon...)
# first coordinates not preceded by colon
poly = 'poly=' + str(coords[0][0][1]) + ',' + str(coords[0][0][0]) # converts to string so can concatenate
for n in range(1, coords.shape[1]): ## iterate through length of area, ie from 2nd coordinate pair through to last coordinate pair
poly = poly + ':' + str(coords[0][n][1]) + ',' + str(coords[0][n][0])
return poly
# create empty dataframe for merging all data
df_iter = pd.DataFrame()
## main function for retrieving data
def crime_api(df_iter, date, areas):
for a in areas:
## Filter by area name into new dataframe
df_b = df_boroughs[df_boroughs['name'] == a]
## len(poly) exceeds character limit so call api with post request
# base url endpoint for all crimes data
base = 'https://data.police.uk/api/crimes-street/all-crime?'
poly = get_coords(df_b) # call function to prep coordinates
## call api and raise any status errors
r = requests.post(base, data={'poly': poly, 'date': date})
r.raise_for_status()
## read array into dataframe
df = pd.json_normalize(r.json())
df['Borough'] = a # add new column with borough identifier
## vertically join new data with any previous data
df_iter = pd.concat([df_iter, df], ignore_index=True)
return df_iter
### **** SET PARAMETERS ****
# set start and end dates in format: YYYY-MM, must be string.
min_month = '2022-03' ## started at 2019-10
max_month = '2022-09'
# period_range function returns a fixed step index between dates specified as monthly frequency
date_range = pd.period_range(min_month, max_month, freq='M')
# convert period_index into a list of dates that can be iterated on
date_list = list(date_range.astype(str))
## create empty dataframe for merging all data
# df_main = pd.DataFrame() # ** uncomment if first run
# iterate through each date in period index, extracting into string data format
for date in date_list:
# assign updated dataframe on each iteration
df_main = pd.concat([df_main, crime_api(df_iter, date, areas)], ignore_index=True)
When running the full download, sometimes a 500 server error code was returned. It seems random and unrelated to any api limits, so believe it to be issue on API end.
The frequency of this was reduced greatly by shortening coordinates to 6dp.
Workaround: since df_main is only updated when monthly data for all boroughs is found, if this error is returned before a full run is complete, the code can be restarted with min_month set after the most recent data in df_main.
## ** TEST CODE **
# proof of concept: 2 areas, 1 time period
areas = ['Bromley', 'Ealing']
date = '2020-05'
df_test = pd.DataFrame()
df_test = pd.concat([df_test, crime_api(df_iter, date, areas)], ignore_index=True)
df_test.sample(5)
| category | location_type | context | outcome_status | persistent_id | id | location_subtype | month | location.latitude | location.street.id | location.street.name | location.longitude | outcome_status.category | outcome_status.date | Borough | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1784 | anti-social-behaviour | Force | NaN | 83912932 | 2020-05 | 51.380602 | 928475 | On or near Eresby Drive | -0.026893 | NaN | NaN | Bromley | |||
| 5606 | bicycle-theft | Force | NaN | e3ede68b0a7682a0f664a085494d175b58eed1f452cc92... | 83982065 | 2020-05 | 51.522892 | 959657 | On or near Montpelier Road | -0.302663 | Investigation complete; no suspect identified | 2020-05 | Ealing | ||
| 2221 | drugs | Force | NaN | de8f36a7278f2d1ed21586c15802acb3ac17ce3a1b6d33... | 83982961 | 2020-05 | 51.392407 | 931158 | On or near Petrol Station | 0.002407 | Court result unavailable | 2020-11 | Bromley | ||
| 1593 | anti-social-behaviour | Force | NaN | 83880433 | 2020-05 | 51.401006 | 931643 | On or near Saxville Road | 0.108425 | NaN | NaN | Bromley | |||
| 4043 | anti-social-behaviour | Force | NaN | 83891351 | 2020-05 | 51.512141 | 958375 | On or near Uxbridge Road The Broadway | -0.383118 | NaN | NaN | Ealing |
Some crimes cannot be mapped to a location, for instance, if the victim cannot recall the location of the crime. The crimes missing this location data can be return by police force, again at a monthly frequency.
This could be a hindrance to any targeted geographical analysis and necessitates comparing with the location crime data to ensure it represents a relatively small part of the overall data.
https://data.police.uk/api/crimes-no-location?category=all-crime&force=metropolitan
# set start and end dates in format: YYYY-MM, must be string.
min_month = '2019-10' ## starts from 2019-10 as of 03/12/22. if error code 404 is returned it is likely because this month is unavailable now (shifting 3 year window, so use later month)
max_month = '2022-09'
months_pr = pd.period_range(min_month, max_month, freq='M') # fill in monthly dates
months = list(months_pr.astype(str)) # convert period_range into a list of dates that can be iterated on
forces = ['metropolitan', 'city-of-london'] # list of London police forces
missing = pd.DataFrame() # empty df for merging data
# loop over dates
for month in months:
for force in forces:
base = 'https://data.police.uk/api/crimes-no-location?category=all-crime&force=%s' % (force)
## call api and raise any status errors
r = requests.post(base, data={'date': month})
r.raise_for_status()
df_temp = pd.json_normalize(r.json()) # read array into dataframe
missing = pd.concat([missing, df_temp], ignore_index=True) # vertically join new data with any previous data
# drop unnecessary columns
missing.drop(['location_type', 'location', 'context', 'outcome_status', 'persistent_id', 'id', 'location_subtype', 'outcome_status.category', 'outcome_status.date'], axis=1, inplace=True)
This calls the api for the desired date range and for both London's police forces. Since we are only interested in the number of crimes for each type, all the unnecessary columns are dropped, leaving only 'month' and 'category' values. This can then be aggregated later by month / type.
## call api endpoint returning crime categories (can use to format names later)
url = 'https://data.police.uk/api/crime-categories?date=2022-09'
cat_data = requests.get(url).json()
df_cat = pd.json_normalize(cat_data)
Main CSV file size roughly 600mb, far above the 100mb limit for GitHub file uploads.
-> Even with xz/bz2 compression, files were roughly 110-120mb. But dataset contains wasted space (ie empty/unnecessary columns) that would be deleted later in cleaning. So delete columns prior to exporting dataset.
Files saved in data folder - later analysis has date ranges available on the API as of 03/12/22, so don't overwrite as may change results.
## drop unnecessary columns
# df_main.drop(['location_type', 'context', 'outcome_status', 'persistent_id', 'location_subtype', 'location.street.id'], axis=1, inplace=True)
## export downloaded data as csv file
# df_main.to_csv('data/all_crimes.csv', index = False) # for standard csv write
# df_main.to_csv('data/all_crimes.bz2', index=False) # for compressed csv write
# missing.to_csv('data/crimes_missing.csv', index=False)
# df_cat.to_csv('data/crime_categories.csv', index = False)
Using xz and bz2 compression types brought file sizes down to 31-33mb, however bz2 write time was roughly 3 times faster than xz so proceeding with this.
Borough specific data has been gathered externally:
Each was downloaded from source in excel format, filtered for London boroughs only, and saved as a csv in the external_data folder. These are loaded in when required.
crimes_raw_df = pd.read_csv('data/all_crimes.bz2') # roughly 7s, read main dataset, pandas automatically decompresses file
crimes_raw_df
| category | id | month | location.latitude | location.street.name | location.longitude | outcome_status.category | outcome_status.date | Borough | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | anti-social-behaviour | 78702917 | 2019-10 | 51.411857 | On or near Supermarket | -0.300998 | NaN | NaN | Kingston upon Thames |
| 1 | anti-social-behaviour | 78702919 | 2019-10 | 51.411857 | On or near Supermarket | -0.300998 | NaN | NaN | Kingston upon Thames |
| 2 | anti-social-behaviour | 78702920 | 2019-10 | 51.414177 | On or near Nightclub | -0.301027 | NaN | NaN | Kingston upon Thames |
| 3 | anti-social-behaviour | 78702921 | 2019-10 | 51.411260 | On or near Nipper Alley | -0.300761 | NaN | NaN | Kingston upon Thames |
| 4 | anti-social-behaviour | 78702922 | 2019-10 | 51.403324 | On or near Bloomfield Road | -0.299847 | NaN | NaN | Kingston upon Thames |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3386726 | other-crime | 104960363 | 2022-09 | 51.516271 | On or near Aldermanbury | -0.092971 | Under investigation | 2022-09 | City of London |
| 3386727 | other-crime | 104960391 | 2022-09 | 51.513631 | On or near Finch Lane | -0.086019 | Under investigation | 2022-09 | City of London |
| 3386728 | other-crime | 104960233 | 2022-09 | 51.517770 | On or near | -0.078495 | Under investigation | 2022-09 | City of London |
| 3386729 | other-crime | 104960109 | 2022-09 | 51.517656 | On or near Sandy's Row | -0.077563 | Offender given a caution | 2022-09 | City of London |
| 3386730 | other-crime | 104960442 | 2022-09 | 51.510559 | On or near Queenhithe | -0.095054 | Under investigation | 2022-09 | City of London |
3386731 rows × 9 columns
Dataset contains 3,386,731 rows, each representing a crime reported in the London area in the 36 months leading to September 2022. Each row details crime type, location type (ie jurisdiction is normal police force or transport police), crime ids unique for API, multiple location data and crime outcomes if any (historical data is updated regularly to match with police and court outcomes).
crimes_raw_df.columns
Index(['category', 'id', 'month', 'location.latitude', 'location.street.name',
'location.longitude', 'outcome_status.category', 'outcome_status.date',
'Borough'],
dtype='object')
## read data for crime incidents not mapped to a location, each row represents an incident.
crimes_missing = pd.read_csv('data/crimes_missing.csv')
crimes_missing
| month | category | |
|---|---|---|
| 0 | 2019-10 | anti-social-behaviour |
| 1 | 2019-10 | anti-social-behaviour |
| 2 | 2019-10 | anti-social-behaviour |
| 3 | 2019-10 | anti-social-behaviour |
| 4 | 2019-10 | anti-social-behaviour |
| ... | ... | ... |
| 44401 | 2022-09 | other-crime |
| 44402 | 2022-09 | other-crime |
| 44403 | 2022-09 | other-crime |
| 44404 | 2022-09 | other-crime |
| 44405 | 2022-09 | other-crime |
44406 rows × 2 columns
# rename columns
crimes_raw_df.columns = ['Crime Category', 'Crime ID', 'Month', 'Latitude', 'Street Name', 'Longitude', 'Outcome', 'Outcome Date', 'Borough']
# format crime categories from url name to nice name, first read csv of category names
cat_df = pd.read_csv('data/crime_categories.csv', index_col = 0)
dict = cat_df.to_dict() # create dictionary from dataframe
crimes_df = crimes_raw_df.replace({'Crime Category': dict['name']}) # use the dictionary to replace all categories names
crimes_missing = crimes_missing.replace({'Crime Category': dict['name']})
# verify no duplicate crimes present, e.g. potentially from area boundaries when downloading from api
crimes_df['Crime ID'].duplicated().sum()
0
Add features extracted from GeoJSON
## merges dataframes using borough names as key
crimes_df = pd.merge(crimes_df, borough_feat, on='Borough')
Add population data, Borough Population CSV extracted from ONS 2021 Census Results first release
Can use later for calculating crime rates
pop_df = pd.read_csv('external_data/Borough_pop_census2021.csv', index_col=0)
crimes_df = pd.merge(crimes_df, pop_df, on='Borough')
# check column data types
crimes_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3386731 entries, 0 to 3386730 Data columns (total 14 columns): # Column Dtype --- ------ ----- 0 Crime Category object 1 Crime ID int64 2 Month object 3 Latitude float64 4 Street Name object 5 Longitude float64 6 Outcome object 7 Outcome Date object 8 Borough object 9 Hectares float64 10 Inner bool 11 Area object 12 Date datetime64[ns] 13 Year int64 dtypes: bool(1), datetime64[ns](1), float64(3), int64(2), object(7) memory usage: 365.0+ MB
Columns with dates are formatted as objects, so create new column with date format and another column with only the year for later analysis.
crimes_df['Date'] = pd.to_datetime(crimes_df['Month'], format='%Y-%m') # create new column with datetime type
crimes_df['Year'] = crimes_df['Date'].dt.year # create new column holding only relevant year value
crimes_df.sample(5)
| Crime Category | Crime ID | Month | Latitude | Street Name | Longitude | Outcome | Outcome Date | Borough | Hectares | Inner | Area | population | Date | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 603107 | Anti-social behaviour | 86403184 | 2020-08 | 51.481915 | On or near Caroline Place | -0.430575 | NaN | NaN | Hillingdon | 11570.063 | False | West | 305900 | 2020-08-01 | 2020 |
| 2309311 | Violence and sexual offences | 80600573 | 2020-01 | 51.515761 | On or near Stourcliffe Street | -0.162390 | Investigation complete; no suspect identified | 2020-03 | Westminster | 2203.005 | True | Central | 204300 | 2020-01-01 | 2020 |
| 811447 | Burglary | 92790017 | 2021-05 | 51.561309 | On or near Sudbury Hill Close | -0.328500 | Investigation complete; no suspect identified | 2021-05 | Brent | 4323.270 | False | West | 339800 | 2021-05-01 | 2021 |
| 1037258 | Anti-social behaviour | 88453957 | 2020-11 | 51.479830 | On or near Old South Lambeth Road | -0.123439 | NaN | NaN | Lambeth | 2724.940 | True | Central | 317600 | 2020-11-01 | 2020 |
| 2524893 | Violence and sexual offences | 102071443 | 2020-03 | 51.517400 | Holborn (lu Station) | -0.120207 | Status update unavailable | 2020-07 | Camden | 2178.932 | True | Central | 210100 | 2020-03-01 | 2020 |
The data straddles the Covid period and thus exploring lockdown effects on crime may be interesting.
lockdown timeline available here - https://www.instituteforgovernment.org.uk/charts/uk-government-coronavirus-lockdowns
Considering full months:
These can then be used as filters when required.
Can also define three 12-month periods
L1 = pd.date_range(start='2020-04-01', end='2020-05-31' ,freq='MS')
L2 = pd.date_range(start='2020-11-01', end='2020-11-30' ,freq='MS')
L3 = pd.date_range(start='2021-01-01', end='2021-03-31' ,freq='MS')
Y1 = pd.date_range(start='2019-10-01', end='2020-09-30' ,freq='MS')
Y2 = pd.date_range(start='2020-10-01', end='2021-09-30' ,freq='MS')
Y3 = pd.date_range(start='2021-10-01', end='2022-09-30' ,freq='MS')
P1 = pd.period_range(start='2019-10-01', end='2020-09-30' ,freq='M')
P2 = pd.period_range(start='2020-10-01', end='2021-09-30' ,freq='M')
P3 = pd.period_range(start='2021-10-01', end='2022-09-30' ,freq='M')
## EG
crimes_df[crimes_df['Date'].isin(Y1)]
Anti-social behaviour isn't recorded in total crime stats so can create new dataframe without, keeping the full one for relevant questions.
## drop anti-social behaviour
incidents = crimes_df.copy()
crimes = crimes_df[(crimes_df['Crime Category'] != 'Anti-social behaviour')].reset_index(drop=True)
The main dataframes have been left unaggregated so can be combined / grouped when necessary.
crimes.sample(5)
| Crime Category | Crime ID | Month | Latitude | Street Name | Longitude | Outcome | Outcome Date | Borough | Hectares | Inner | Area | population | Date | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1430494 | Violence and sexual offences | 102523035 | 2022-06 | 51.425407 | On or near High Street | -0.219138 | Under investigation | 2022-06 | Merton | 3762.466 | False | South | 215200 | 2022-06-01 | 2022 |
| 1029118 | Violence and sexual offences | 99050329 | 2022-01 | 51.464911 | On or near Halsbrook Road | 0.036605 | Investigation complete; no suspect identified | 2022-03 | Greenwich | 5044.190 | False | East | 289100 | 2022-01-01 | 2022 |
| 1714438 | Violence and sexual offences | 94043766 | 2021-07 | 51.513664 | On or near Frith Street | -0.131575 | Investigation complete; no suspect identified | 2021-07 | Westminster | 2203.005 | True | Central | 204300 | 2021-07-01 | 2021 |
| 2267228 | Violence and sexual offences | 84607321 | 2020-06 | 51.525729 | On or near Marlow Road | 0.054642 | Investigation complete; no suspect identified | 2020-06 | Newham | 3857.806 | True | East | 351100 | 2020-06-01 | 2020 |
| 806669 | Vehicle crime | 81095047 | 2020-02 | 51.483669 | On or near John Ruskin Street | -0.094404 | Court result unavailable | 2020-08 | Southwark | 2991.340 | True | Central | 307700 | 2020-02-01 | 2020 |
First can look at how total incidents are behaving across the data range.
## create new dataframes with monthly crime totals, rename & add indicator columns, then vertically concat
exp1a = pd.DataFrame(incidents['Month'].value_counts()).reset_index()
exp1b = pd.DataFrame(crimes['Month'].value_counts()).reset_index()
exp1a.columns = ['Month', 'Crimes']
exp1b.columns = ['Month', 'Crimes']
exp1a['Measure'] = 'Crimes (incl. anti-social behaviour)'
exp1b['Measure'] = 'Crimes'
exp1 = pd.concat([exp1a, exp1b], ignore_index=True)
alt.Chart(exp1).mark_line(point=True).encode(
x = alt.X('Month:T', title=None, axis=alt.Axis(grid=False)),
y = alt.Y('Crimes:Q', title=None),
color = alt.Color('Measure:N', legend=alt.Legend(orient='bottom-right')),
tooltip = [alt.Tooltip('Date:T', title='Month', format='%b %Y'), alt.Tooltip('Crimes:Q', format=',')]
).properties(
width = 600,
title = 'London: Monthly Crime Totals'
)
Crime rates roughly consistent over the period, with a sharp drops around the lockdown periods (except for anti-social behaviour) but no clear trend.
How does this translate to a yearly total? Will consider our three 12-month periods.
# create function to apply to DF to add column value if in certain date range
def add_period(df):
if df['Month'] in (list(P1.astype(str))):
return '2019-20'
elif df['Month'] in (list(P2.astype(str))):
return '2020-21'
elif df['Month'] in (list(P3.astype(str))):
return '2021-22'
exp1['Period'] = exp1.apply(add_period, axis = 1)
alt.Chart(exp1).mark_bar(size=40, opacity=0.7).encode(
x = alt.X('Period:O', title=None, axis=alt.Axis(labelAngle=-30, labelOffset=30)),
y = alt.Y('sum(Crimes):Q', stack=None, title=None, scale=alt.Scale(domain=[0, 1400000])),
color = alt.Color('Measure:N', title=None, legend=alt.Legend(orient='top-right')),
).properties(
width = 300,
title = 'London: Yearly Crime Totals'
)
The grey bar represents only crimes while the grey + yellow represent all crime incidents.
The year-to-2021 and year-to-2022 totals are similar, however the decrease in anti-social behaviour incidents masks a slight rise in crime. Although the period two periods both contained lockdowns so not trend conclusions can be drawn yet.
Identify which crimes are most and least common across the whole 36 month period.
## Each row in main represents an incident, so counting each unique instance of a Crime Category will give total instances per category.
exp3 = pd.DataFrame(incidents['Crime Category'].value_counts()).reset_index()
exp3.columns = ['Category', 'Total Incidents']
Most frequent crime types can be visualised with a bar chart
alt.Chart(exp3).mark_bar().encode(
x = alt.X('Total Incidents:Q'),
y = alt.Y('Category:N', sort = '-x', title=None),
tooltip = [alt.Tooltip('Total Incidents:Q', format=',')]
).properties(
title = 'London: Total Crime Incidents (Oct-2019 to Sep-2022)'
)
/Users/joshhellings/miniforge3/envs/datasci/lib/python3.9/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for col_name, dtype in df.dtypes.iteritems():
From the ONS: 'Violent crime covers a wide range of offences including minor assaults (such as pushing and shoving), harassment and abuse (that result in no physical harm) through to wounding and homicide. Sexual offences include rape, sexual assault and unlawful sexual activity against adults and children, sexual grooming and indecent exposure.'
ASB shows by far the highest total incidents, expected as it is not typically included in crime stats. Violence and sexual offences includes lots of sub-categories so liekly also explains why this is very high.
The possession of weapons in London is often talked about as a an epidemic, however we can see it makes up a tiny proportion of crimes with 16,557 recorded incidents over the last 3 years.
# find total number of monthly crime incidents for each borough
exp4 = crimes.groupby(['Crime Category', 'Month'], as_index=False)['Month'].value_counts()
alt.Chart(exp4).mark_boxplot(size=50, ticks=True).encode(
x = alt.X('Crime Category:N', title=None, axis=alt.Axis(labels=False, ticks=False), scale=alt.Scale(padding=1)),
y = alt.Y('count:Q', title=None),
color = alt.Color('Crime Category:N', legend=None),
facet = alt.Facet('Crime Category:O', columns=7, title='London: Monthly crime dispersion in the last 3 years'),
).properties(
width=100,
height = 150
).resolve_scale(
y='independent',
x='independent'
)
This boxplot shows how the monthly crime totals vary for each crime category - in the 36 month period Oct-2019 to Sep-2022.
The whiskers are extended to the furthest points within 1.5 x the interquartile range, from the 1st and 3rd quartile. Any outliers to this are plotted as well.
Theft from the person and other theft shows the biggest dispersion, likely due to the lockdowns. Similarly, bicycle theft shows a big dispersion weighted upwards, suggesting a few very high monthly values.
To visualise this, we can use the initial London borough map geojson and combine with stats aggregated at the borough level. Then can display as a chloropleth map to any identify high-crime clusters.
## counts unique instances of borough name, ie number of crimes reported by borough across the whole period.
borough_freq = pd.DataFrame(crimes['Borough'].value_counts()).reset_index()
borough_freq.columns = ['Borough', 'Total Incidents']
# uses Altair package to plot a Chloropleth map of crime incidents by London Borough.
# initial geoDataFrame from arcgis api is used as a base map, with a data lookup to get crime incidents from dataframe.
alt.Chart(df_boroughs).mark_geoshape().encode(
color='Total Incidents:Q',
tooltip= [alt.Tooltip('name:N', title='Borough'), alt.Tooltip('Total Incidents:Q', format=',')]
).transform_lookup(
lookup='name',
from_=alt.LookupData(borough_freq, 'Borough', ['Total Incidents'])
).project(
type='mercator'
).properties(
width=500,
height=300,
title='London: Mapping 3 years of crime'
)
Generally shows higher crime totals in inner London, with City of London an outlier due its smaller area and population. To better compare across Boroughs it would be beneficial to combine with population data and calculate crime rates.
Borough Population CSV extracted from ONS 2021 Census Results first release.
Crime rates typically found as a yearly rate, so will extract only 2021 crime data.
## create boolean variable for when year is 2021, then apply this to main df as a filter, leaving only crimes reported in 2021.
# is_2021 = crimes_df['Year'] == 2021
crimes_2021 = crimes[crimes['Year'] == 2021]
crimes_2021.sample(3)
| Crime Category | Crime ID | Month | Latitude | Street Name | Longitude | Outcome | Outcome Date | Borough | Hectares | Inner | Area | population | Date | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1411524 | Vehicle crime | 89907023 | 2021-01 | 51.415044 | On or near Garden Avenue | -0.150958 | Status update unavailable | 2021-05 | Merton | 3762.466 | False | South | 215200 | 2021-01-01 | 2021 |
| 1004630 | Other theft | 90704636 | 2021-02 | 51.488995 | On or near Parking Area | 0.069426 | Investigation complete; no suspect identified | 2022-01 | Greenwich | 5044.190 | False | East | 289100 | 2021-02-01 | 2021 |
| 838831 | Shoplifting | 91448735 | 2021-03 | 51.471558 | On or near Cerise Road | -0.068112 | Investigation complete; no suspect identified | 2021-03 | Southwark | 2991.340 | True | Central | 307700 | 2021-03-01 | 2021 |
## apply to find incidents for each borough in 2021
borough_freq_2021 = pd.DataFrame(crimes_2021['Borough'].value_counts()).reset_index()
borough_freq_2021.columns = ['Borough', 'Total Incidents']
borough_freq_2021.head()
| Borough | Total Incidents | |
|---|---|---|
| 0 | Westminster | 49294 |
| 1 | Newham | 32807 |
| 2 | Croydon | 31917 |
| 3 | Tower Hamlets | 31569 |
| 4 | Lambeth | 31526 |
Now import population data, merge the files using the shared borough names, and calculate crime rates as per 1,000 population.
City of London population is only 8600 people, but working/day population is as high as 500,000, so will drastically inflate crime rates, as such can drop that from the data.
# Borough Population CSV extracted from ONS 2021 Census Results first release
pop_df = pd.read_csv('/Users/joshhellings/Documents/OneDrive - University of Bristol/FinTech/SDPA/CW_part2/external_data/Borough_pop_census2021.csv', index_col = 0)
crime_rates = pd.merge(borough_freq_2021, pop_df, on=['Borough']) # create empty population column
crime_rates['Crime Rate'] = (1000*crime_rates['Total Incidents']) / crime_rates['population']
crime_rates = crime_rates.iloc[:-1,:].copy() ## Drop last row, ie City of London
# as before, but now using crime rate for 2021 rather than total incidents.
alt.Chart(df_boroughs).mark_geoshape().encode(
color='Crime Rate:Q',
tooltip= [alt.Tooltip('name:N', title='Borough'), alt.Tooltip('Crime Rate:Q', format='d', title='Crime Rate per 1,000 people')]
).transform_lookup(
lookup='name',
from_=alt.LookupData(crime_rates, 'Borough', ['Crime Rate'])
).project(
type='mercator'
).properties(
width=500,
height=300,
title='London: 2021 Crime rates by borough (per 1,000 people)'
)
Crime rates now clearly higher for inner London, with Westminster by far the highest. The high tourist population is generally blamed for the crime rates in the Borough of Westminster. Also, this visualisation uses the newest 2021 Census borough population, which for Westminster has actually decreased since 2011, thereby inflating its crime rate further.
Is this reflected in the type of crimes committed?
## create boolean variable for when year is 2021, then apply this to main df as a filter, leaving only crimes reported in 2021.
Westmin_df = pd.DataFrame(crimes[crimes['Borough'] == 'Westminster']['Crime Category'].value_counts()).reset_index()
Westmin_df.columns = ['Category', 'Total Incidents']
Find crimes per type for the whole of London
filt = crimes[crimes['Year'] == 2021]
Lon2021 = pd.DataFrame(filt['Crime Category'].value_counts()).reset_index()
Lon2021.columns = ['Category', 'Total Incidents']
Plot crime incidents by type in Westminster in 2021
Westmin = alt.Chart(Westmin_df).mark_bar().encode(
x = alt.X('Total Incidents:Q'),
y = alt.Y('Category:N', sort = '-x', title=None),
tooltip = [alt.Tooltip('Total Incidents:Q', format=',')]
).properties(
title = 'Westminster: Crime in 2021'
)
London_av = alt.Chart(Lon2021).mark_bar().encode(
x = alt.X('Total Incidents:Q'),
y = alt.Y('Category:N', sort = '-x', title=None),
tooltip = [alt.Tooltip('Total Incidents:Q', format=',')]
).properties(
title = 'London: Crime in 2021'
)
Westmin | London_av
Lastly, can look at crime outcomes
crimes['Outcome'].value_counts(normalize=True).round(3)
Investigation complete; no suspect identified 0.630 Status update unavailable 0.201 Under investigation 0.076 Court result unavailable 0.039 Local resolution 0.031 Offender given penalty notice 0.007 Offender given a caution 0.007 Awaiting court outcome 0.007 Unable to prosecute suspect 0.001 Offender given a drugs possession warning 0.000 Formal action is not in the public interest 0.000 Action to be taken by another organisation 0.000 Further investigation is not in the public interest 0.000 Suspect charged as part of another case 0.000 Further action is not in the public interest 0.000 Name: Outcome, dtype: float64
By far, most crimes result in no suspect being identified.
## create new dataframes with monthly crime totals, rename & add indicator columns, then vertically concat
q1_1a = pd.DataFrame(incidents['Month'].value_counts()).reset_index()
q1_1b = pd.DataFrame(crimes['Month'].value_counts()).reset_index()
q1_1a.columns = ['Month', 'Crimes']
q1_1b.columns = ['Month', 'Crimes']
q1_1a['Measure'] = 'Crimes (incl. anti-social behaviour)'
q1_1b['Measure'] = 'Crimes'
q1_1 = pd.concat([q1_1a, q1_1b], ignore_index=True)
alt.Chart(q1_1).mark_line(point=True).encode(
x = alt.X('Month:T', title=None, axis=alt.Axis(grid=False)),
y = alt.Y('Crimes:Q', title=None),
color = alt.Color('Measure:N', legend=alt.Legend(orient='bottom-right')),
tooltip = [alt.Tooltip('Date:T', title='Month', format='%b %Y'), alt.Tooltip('Crimes:Q', format=',')]
).properties(
width = 600,
title = 'London: Monthly Crime Totals'
)
No obvious long term trend, crime and anti-socials behaviour incidents track eachother except for March -> June 2020.
Since the middle of 2021 crime totals have been mostly stable around 70,000, with all incidents stable between 80,000 and 100,000. Crime incidents spiked in April and May of 2020, the first two full months of nationally imposed lockdowns.
Incidents of anti-social behaviour clearly spike in first few months of nationally imposed lockdown, likely because covid breaches were typically reported as ASB.
To see how crime incidents have changed over the data period, we can use a similar multi-line chart as before with a couple key changes:
i) iterate through each crime type and calculate three month rolling mean for crimes of that type. Such that calculated 2020-01 value is a mean of 2019-11, 2019-12, and 2020-01 total monthly crime values
## calculate monthly crime total per crime type
cat_monthly = crimes.groupby(['Crime Category', 'Month'], as_index=False)['Month'].value_counts()
## create list of unique crime types to iterate over
types = list(cat_monthly['Crime Category'].unique())
df_roll = pd.DataFrame()
for t in types:
subset = cat_monthly[(cat_monthly['Crime Category'] == t)].reset_index(drop=True)
subset['Crimes (3-month average)'] = subset.rolling(3).mean()
subset = subset.iloc[2:,:].copy() # drop first two rows
df_roll = pd.concat([df_roll, subset], ignore_index=True)
df_roll
| Crime Category | Month | count | Crimes (3-month average) | |
|---|---|---|---|---|
| 0 | Bicycle theft | 2019-12 | 1030 | 1363.666667 |
| 1 | Bicycle theft | 2020-01 | 1251 | 1205.333333 |
| 2 | Bicycle theft | 2020-02 | 1123 | 1134.666667 |
| 3 | Bicycle theft | 2020-03 | 1128 | 1167.333333 |
| 4 | Bicycle theft | 2020-04 | 1080 | 1110.333333 |
| ... | ... | ... | ... | ... |
| 437 | Violence and sexual offences | 2022-05 | 23025 | 21774.000000 |
| 438 | Violence and sexual offences | 2022-06 | 21682 | 21671.666667 |
| 439 | Violence and sexual offences | 2022-07 | 22883 | 22530.000000 |
| 440 | Violence and sexual offences | 2022-08 | 21647 | 22070.666667 |
| 441 | Violence and sexual offences | 2022-09 | 20513 | 21681.000000 |
442 rows × 4 columns
ii) Index values, again by taking a subset for each crime type, then divide every observation by the first value
## calculate index for crime count = 100 at t1
index = []
for t in types:
subset = df_roll[(df_roll['Crime Category'] == t)].reset_index(drop=True)
indexrow = subset[:1] ## select first row to be divisor
subset['Rolling crime average'] = (subset['Crimes (3-month average)']/ indexrow['Crimes (3-month average)'][0]) * 100
index.extend(list(subset['Rolling crime average']))
df_roll['Index'] = index # add index to original DF
df_roll.head()
| Crime Category | Month | count | Crimes (3-month average) | Index | |
|---|---|---|---|---|---|
| 0 | Bicycle theft | 2019-12 | 1030 | 1363.666667 | 100.000000 |
| 1 | Bicycle theft | 2020-01 | 1251 | 1205.333333 | 88.389147 |
| 2 | Bicycle theft | 2020-02 | 1123 | 1134.666667 | 83.207040 |
| 3 | Bicycle theft | 2020-03 | 1128 | 1167.333333 | 85.602542 |
| 4 | Bicycle theft | 2020-04 | 1080 | 1110.333333 | 81.422635 |
Plot all index results on a multi-line chart
alt.Chart(df_roll).mark_line(point='transparent').encode(
x = alt.X('Month:O', title=None),
y = alt.Y('Index:Q', title=None),
color = alt.Color('Crime Category:N')
).properties(
title = 'London: Monthly crime incidents (3 month rolling average)',
width = 600,
height = 300
)
Interestingly, the crimes that diverged during the first lockdown period have remained mostly separated - i.e., mostly stay on either side of y=95.
This also shows some seasonality, most clearly evident with bicycle theft, but also noticable with public order - may also be true of some other categories but lockdowns likely to cause simialr behaviour.
Bike theft in the summer of 2020 was around 50% higher than subsequent years, and this is likely a result of the cycling becoming especially popular during and after the first lockdown, with new bikes scarcely available and prices far higher than normal.
Since we have three years of data but not three full calender years, can use three 12-month periods instead.
## use function from part 3 to add new column with relevant yearly period (P1, P2, P3)
cat_monthly['Year'] = cat_monthly.apply(add_period, axis=1)
cat_monthly.sample(3)
| Crime Category | Month | count | Year | |
|---|---|---|---|---|
| 56 | Burglary | 2021-06 | 4232 | 2020-21 |
| 327 | Shoplifting | 2020-01 | 4008 | 2019-20 |
| 266 | Public order | 2020-12 | 4106 | 2020-21 |
Now can find the number of crime incidents average monthly across the each 12-month period.
cat_yearly = cat_monthly.groupby(['Crime Category', 'Year'], as_index=False)['count'].mean().round(0)
cat_yearly.sample(3)
| Crime Category | Year | count | |
|---|---|---|---|
| 2 | Bicycle theft | 2021-22 | 1630.0 |
| 29 | Shoplifting | 2021-22 | 3111.0 |
| 25 | Robbery | 2020-21 | 1906.0 |
All the data is now in a dataframe so can be displayed. For this can used a faceted line plot with points for each of the yearly periods.
alt.Chart(cat_yearly).mark_line(point=True).encode(
x = alt.X('Year:O', title=None, axis=alt.Axis(labelAngle=-40, labelOffset=10)),
y = alt.Y('count:Q', title=None),
color = alt.Color('Crime Category:N'),
tooltip = [alt.Tooltip('Crime Category:N'), alt.Tooltip('Year:N'), alt.Tooltip('count:Q', title='Avergae Monthly Incidents', format=',d')],
facet = alt.Facet('Crime Category:O', columns=7, title='London: Yearly crime incidents (averaged monthly)'),
).properties(
width = 100,
height = 100
).resolve_scale(
y='independent'
)
So, when considering each category for the yearly periods (Oct through Sep), there have been some interesting changes:
Burglary and robbery have both trended down slightly, but theft from the persona and other theft ahs increased significantly - potentially lockdown effects.
So we have shown that the relatively stable overall crime rate actually masks alot of changes between crime categories. Now we can consider if this is masking even greater changes among boroughs.
For this, will repeat similar steps as before to find incidents across each yearly period (using yearly total now rather than average), but now also grouping by borough.
bor_monthly = crimes.groupby(['Borough', 'Crime Category', 'Month'], as_index=False)['Month'].value_counts()
## use function from part 3 to add new column with relevant yearly period (P1, P2, P3)
bor_monthly['Year'] = bor_monthly.apply(add_period, axis=1)
## find total yearly crime incidents for each category in each borough
bor_yearly = bor_monthly.groupby(['Borough', 'Crime Category', 'Year'], as_index=False)['count'].sum()
Now for each borough, can find the change between P1 and P3 in each crime type.
bor_wide = pd.pivot(bor_yearly, index=['Borough', 'Crime Category'], columns=['Year'], values='count')
bor_wide.reset_index(inplace=True) # reset to remove multi-index
bor_wide.rename_axis(None, axis=1) # remove name from index
# calculate percentage change in two year period
bor_wide['Two-year Change'] = (bor_wide['2021-22'] - bor_wide['2019-20']) / bor_wide['2019-20']
# drop City of London data as very sensitive to change
bor_wide = bor_wide[(bor_wide['Borough'] != 'City of London')]
Now for each category, find the greatest positive or negative changes
df_highlow = pd.DataFrame()
for t in types:
subset = bor_wide[(bor_wide['Crime Category'] == t)].reset_index(drop=True)
subset = subset.sort_values(by=['Two-year Change'], ascending=False, ignore_index=True)
subset = pd.concat([subset.head(3), subset.tail(3)], ignore_index=True) ## take 3 highest and lowest
df_highlow = pd.concat([df_highlow, subset], ignore_index=True)
Plot all the results on separate bar charts, diverging scale used to show the differences
alt.Chart(df_highlow).mark_bar().encode(
y = alt.Y('Borough:N', sort='-x', title = None),
x = alt.X('Two-year Change:Q', axis=alt.Axis(format='%'), title=None),
color = alt.Color('Two-year Change', scale=alt.Scale(scheme='blueOrange'), legend=None),
tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Two-year Change:Q', format='.1%'), alt.Tooltip('2021-22:Q', title='Total Incidents')],
facet = alt.Facet('Crime Category:O', columns=5, title='London: Which boroughs have seen the biggest changes in crime in the last two years?'),
).properties(
width = 120,
height = 100
).resolve_scale(
y='independent',
x='independent',
color='independent'
)
The chart above plots the 3 biggest positive and negative changes in total crime incidents across the two-year period: ie from 2019-20 to 2021-22.
This shows again that observing relatively small changes in crime but aggregated across the whole of London, can mask much larger changes on a smaller level.
Westminster topped the charts in three areas: other theft, public order, violence and sexual offences. Potentially as its high crime incidents are are tied to the high day and tourist population, which dwinded in the first period due to covid lockdowns.
Burglary has falled across every borough, suggesting maybe a structural impact of covid, for example more people working from home, has had an effect.
Conversely, violence and sexual offences have increased in every borough.
...and if so, can any relationships be drawn?
The ONS publishes the English Indices of Deprivation every 3-5 years (latest available 2019), that collates multiple measures of inequality at the LAD and LSOA level. London boroughs make up London's Local Authority Districts so can use this to merge the datasets. Similarly to population, this data is not especially fast moving, so we shouldn't expected any significant effects from measurement periods not exactly aligning.
We can first filter the crime dataset to a yearly period - will use full dataset with anti-social behaviour. August 2021 -> July 2022 was chosen as the last UK covid restrictions were lifted in July 2021, so choosing this period, rather than 2021 or any earlier, should remove/reduce possibility of covid effects.
# df1 = crimes[(crimes['Month'] >= '2021-08') & (crimes['Month'] <= '2022-07')]
df1 = incidents[(incidents['Month'] >= '2021-08') & (incidents['Month'] <= '2022-07')]
df2 = df1.groupby(['Borough'], as_index=False)['Crime Category'].value_counts()
## load population data, merge, calculate crime rate
pop_df = pd.read_csv('external_data/Borough_pop_census2021.csv', index_col = 0)
df3 = pd.merge(df2, pop_df, on=['Borough'])
df3['Crime Rate'] = ((1000*df3['count']) / df3['population']).round(2)
df3.drop(['count'], axis=1, inplace=True) ## drop count column as only need crime rate
## convert from long-form to wide-form dataframe format
df4 = pd.pivot(df3, index=['Borough', 'population'], columns=['Crime Category'], values='Crime Rate')
df4.reset_index(inplace=True)
## load deprivation data, clean, and combine with crime df
dep_df = pd.read_csv('external_data/Borough_deprivation.csv')
# convert to percentages
dep_df['Income deprivation rate (%)'] = dep_df['Income deprivation rate (%)'] * 100
dep_df['Deprivation gap (%)'] = dep_df['Deprivation gap (%)']*100
df5 = pd.merge(df4, dep_df, on='Borough')
## drop data for City of London -> exceptionally low population (roughly 8000) wildly distorts crime rates
df5 = df5[df5['Borough'] != 'City of London']
dep_df.head()
| LAD code 2019 | Borough | Profile | Deprivation gap (%) | Deprivation gap ranking | Moran's I | Moran's I ranking | Income deprivation rate (%) | Income deprivation rate ranking | Income deprivation rate quintile | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | E09000001 | City of London | Less income deprived | 20.0 | 255 | -0.15 | 316 | 7.0 | 280 | 5 |
| 1 | E09000002 | Barking and Dagenham | More income deprived | 25.0 | 195 | 0.27 | 175 | 19.0 | 20 | 1 |
| 2 | E09000003 | Barnet | n-shape | 32.0 | 132 | 0.36 | 105 | 11.0 | 148 | 3 |
| 3 | E09000004 | Bexley | Flat | 26.0 | 194 | 0.57 | 21 | 11.0 | 169 | 3 |
| 4 | E09000005 | Brent | More income deprived | 32.0 | 126 | 0.55 | 26 | 16.0 | 65 | 2 |
ONS definitions:
# available crime types
'Anti-social behaviour', 'Bicycle theft', 'Burglary', 'Criminal damage and arson', 'Drugs', 'Other theft', 'Possession of weapons', 'Public order', 'Robbery', 'Shoplifting', 'Theft from the person', 'Vehicle crime', 'Violence and sexual offences', 'Other crime'
drugs = alt.Chart(df5).mark_circle().encode(
x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 22])),
y = alt.Y('Drugs:Q', title=None),
color = alt.Color('Borough:N', legend=None),
size = 'population',
tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('population:Q', title='Population', format=','), alt.Tooltip('Drugs:Q', title='Drug crime rate'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
title = 'Drug offence crime rate (per 1,000 people)',
width = 350
)
shoplifting = alt.Chart(df5).mark_circle().encode(
x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 22])),
y = alt.Y('Shoplifting:Q', title=None),
color = alt.Color('Borough:N', legend=None),
size = 'population',
tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('population:Q', title='Population', format=','), alt.Tooltip('Shoplifting:Q', title='Shoplifting crime rate'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
title = 'Shoplifting crime rate (per 1,000 people)',
width = 350
)
violence = alt.Chart(df5).mark_circle().encode(
x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 22])),
y = alt.Y('Violence and sexual offences:Q', title=None),
color = alt.Color('Borough:N', legend=None),
size = 'population',
tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('population:Q', title='Population', format=','), alt.Tooltip('Violence and sexual offences:Q', title='Violence and sexual offences'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
title = 'Violence and sexual offences (per 1,000 people)',
width = 350
)
## could be shortened by using .repeat() but lose ability to set unique titles.
drugs | shoplifting | violence
This shows 3 crime rate types the 2019 income deprivation rate for each London borough. The deprivation rate is defined by the ONS as measuring the 'proportion of the population experiencing deprivation relating to low income'.
As such, an upward trend without many outliers suggests that crime rates correlate with higher income deprivation.
Both drug offences and violence & sexual offences crime types show a clear upward correlation, while (perhaps surprisingly) shoplifting shows a relatively flat relationship with income deprivation.
Again, Westminster is a clear outlier with a near-average deprivation rate but a crime rate 2-4 times higher than any other borough - the high tourist + day population compared to the resient population is the most likely cause.
There is clearly some relationship (whether spurious or not) between income deprivation and crime rates, so motivates exploring across all crime types
## create list of unique crime categories to iterate over
types = list(df2['Crime Category'].unique())
corr_df = pd.DataFrame(columns=['Crime Type', 'Correlation', 'p-value'])
for crime in types:
data1 = df5['Income deprivation rate (%)']
data2 = df5[crime]
# calculate Pearson's correlation
corr, pvalue = pearsonr(data1, data2)
row = [crime, corr, pvalue]
corr_df.loc[len(corr_df)] = row
## conditional cell highlighting function
def pvalue_highlight(row):
val = row.loc['p-value'] # assign valuefrom p-value column to test
if val < 0.01:
color = '#9080ff'
elif val < 0.05:
color = '#776bcd'
elif val < 0.1:
color = '#48446e'
else:
color = ''
return ['background-color: {}'.format(color) for r in row]
corr_df.style.apply(pvalue_highlight, axis=1).format('{:.4f}', subset=['Correlation','p-value'])
| Crime Type | Correlation | p-value | |
|---|---|---|---|
| 0 | Violence and sexual offences | 0.5320 | 0.0017 |
| 1 | Anti-social behaviour | 0.4683 | 0.0069 |
| 2 | Vehicle crime | 0.3740 | 0.0350 |
| 3 | Other theft | 0.1506 | 0.4107 |
| 4 | Criminal damage and arson | 0.4514 | 0.0095 |
| 5 | Drugs | 0.4454 | 0.0106 |
| 6 | Public order | 0.3503 | 0.0494 |
| 7 | Burglary | 0.4312 | 0.0137 |
| 8 | Shoplifting | -0.0066 | 0.9712 |
| 9 | Robbery | 0.3861 | 0.0291 |
| 10 | Theft from the person | 0.1772 | 0.3318 |
| 11 | Other crime | 0.0517 | 0.7786 |
| 12 | Bicycle theft | 0.2228 | 0.2203 |
| 13 | Possession of weapons | 0.5177 | 0.0024 |
The highlighted rows indicate a statistically significant correlation between income deprivation rate and the respective crime type. The lightest purple indicates significance at the 1% level (Violence and sexual offences, anti-social behaviour, possesion of weapons), the normal purple indicates significance at the 5% level (vehicle crime, criminal damage and arson, drugs, burglary, robbery), while the darkest purple indicates significance at the 10% level (public order).
So, we find statistically significant correlations between income deprivation rate and most crime types, with the following exceptions:
The ONS deprivation data contains other useful metrics that could factor into crime rates. Moran's I measures the extent to which deprivation is clustered. For instance, areas with high deprivation rates but low deprivation clusterings may have lower crime rates than similarly deprived areas with high clustering. The deprivation gap measures the percentage difference between the most and least deprived neighbourhoods in that area and thus is an indication of local inequality.
Each of these three metrics are somewhat interlinked, but distinct enough to prevent any significant multicollinearity in a combined model - can check this manually with VIF
First, we can explore the relationship with drug crime offences.
## set independent and dep vars
X = df5[['Income deprivation rate (%)', "Moran's I", 'Deprivation gap (%)']]
y = df5['Drugs']
X = sm.add_constant(X) # add a constant, ie with 0 for each of our values, we would still expect crime rate to be non-zero
model = sm.OLS(y,X).fit() # fit model
predictions = model.summary() # summarise results
predictions
| Dep. Variable: | Drugs | R-squared: | 0.315 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.241 |
| Method: | Least Squares | F-statistic: | 4.282 |
| Date: | Wed, 11 Jan 2023 | Prob (F-statistic): | 0.0131 |
| Time: | 16:03:24 | Log-Likelihood: | -65.450 |
| No. Observations: | 32 | AIC: | 138.9 |
| Df Residuals: | 28 | BIC: | 144.8 |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -3.1381 | 2.497 | -1.257 | 0.219 | -8.252 | 1.976 |
| Income deprivation rate (%) | 0.1902 | 0.116 | 1.634 | 0.113 | -0.048 | 0.429 |
| Moran's I | -3.2379 | 2.684 | -1.207 | 0.238 | -8.735 | 2.259 |
| Deprivation gap (%) | 0.2152 | 0.099 | 2.178 | 0.038 | 0.013 | 0.418 |
| Omnibus: | 44.567 | Durbin-Watson: | 1.369 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 195.795 |
| Skew: | 2.919 | Prob(JB): | 3.05e-43 |
| Kurtosis: | 13.619 | Cond. No. | 264. |
The regression results are somewhat surprising, in that both income deprivation rate and Moran's I have p-values above 0.1, and thus are shown to be poor predictors of drug crime rate.
Conversely, deprivation gap is statistically significant at the 5% level with a coefficient value of 0.222. This suggests, for every 1% rise in the deprivation gap (ie disparities between neighbourhoods and concentrations of deprivation), the drug crime rate can be expected to increase rise by 0.222 per 1000 people.
It's worth noting, the adjusted r-squared value is relatively low at 0.239, suggesting only 24% of drug crime rate disparities are explained by our independent variables.
## check multicollinearity
vif = pd.DataFrame()
vif['VIF factor'] = [variance_inflation_factor(X.values, i) for i in range(X.values.shape[1])]
vif["features"] = X.columns
print(vif.round(1))
VIF factor features 0 49.9 const 1 1.2 Income deprivation rate (%) 2 1.5 Moran's I 3 1.6 Deprivation gap (%)
High multicollinearity between independent variables can inflate standard errors and thus distort parameter estimates. All VIF factors are close to 1, we can therefore conclude that the 3 deprivation metrics have little correlation between eachother - i.e. sufficiently low that we can accept the inference of the results.
Also, from the scatter plots above, the variance between boroughs seems consistent across deprivation rates and thus would not expect the heteroskedasticity to be a problem.
We can apply this same methodology to each crime type, using a forloop to fit the model between each types and the three deprivation measures.
To intuitively display the data together: first a multi-index dataframe is created so the coefficient and p-value of each explanatory can be displayed together, then all the data is added in the same forloop that runs fits the model, finally can apply a function that conditionally highlights the cells - applying this to individual cells in a multi-index dataframe proved especially tricky hence the long function.
headers = [
np.array(['Crime Type', 'Income deprivation rate (%)', 'Income deprivation rate (%)', "Moran's I", "Moran's I", 'Deprivation gap (%)', 'Deprivation gap (%)']),
np.array(['', 'coef', 'p-value', 'coef', 'p-value', 'coef', 'p-value']),
] # set headers for multi-index dataframe
regress_df = pd.DataFrame(columns=headers)
X = df5[['Income deprivation rate (%)', "Moran's I", 'Deprivation gap (%)']]
for crime in types:
y = df5[crime]
X = sm.add_constant(X) ## add constant to model
model = sm.OLS(y,X).fit()
row = [crime, model.params[1].round(4), model.pvalues[1].round(4), model.params[2].round(4), model.pvalues[2].round(4), model.params[3].round(4), model.pvalues[3].round(4)]
regress_df.loc[len(regress_df)] = row
## conditional cell highlighting function
def pvalue_highlight(row):
s10 = 'background-color: #48446e'
s5 = 'background-color: #776bcd'
s1 = 'background-color: #9080ff'
default = ''
if row['Income deprivation rate (%)']['p-value'] < 0.1:
I = s10
if row['Income deprivation rate (%)']['p-value'] < 0.05:
I = s5
if row['Income deprivation rate (%)']['p-value'] < 0.01:
I = s1
else: I = default
if row["Moran's I"]['p-value'] < 0.1:
M = s10
if row["Moran's I"]['p-value'] < 0.05:
M = s5
if row["Moran's I"]['p-value'] < 0.01:
M = s1
else: M = default
if row['Deprivation gap (%)']['p-value'] < 0.1:
D = s10
if row['Deprivation gap (%)']['p-value'] < 0.05:
D = s5
if row['Deprivation gap (%)']['p-value'] < 0.01:
D = s1
else: D = default
return [default, I, default, M, default, D, default] ## only highlight coef. columns
## for each row apply the function pvalue_highlight.
regress_df.style.apply(pvalue_highlight, axis=1).format('{:.4f}', subset=['Income deprivation rate (%)', "Moran's I", 'Deprivation gap (%)'])
| Crime Type | Income deprivation rate (%) | Moran's I | Deprivation gap (%) | ||||
|---|---|---|---|---|---|---|---|
| coef | p-value | coef | p-value | coef | p-value | ||
| 0 | Violence and sexual offences | 0.8698 | 0.0249 | -4.1379 | 0.6287 | 0.7615 | 0.0211 |
| 1 | Anti-social behaviour | 0.9329 | 0.0864 | -11.1572 | 0.3645 | 1.0995 | 0.0200 |
| 2 | Vehicle crime | 0.2160 | 0.1204 | 4.3685 | 0.1710 | 0.2684 | 0.0264 |
| 3 | Other theft | -0.1163 | 0.8774 | -11.0515 | 0.5265 | 1.4871 | 0.0264 |
| 4 | Criminal damage and arson | 0.1062 | 0.1027 | -0.8733 | 0.5524 | 0.1345 | 0.0179 |
| 5 | Drugs | 0.1902 | 0.1135 | -3.2379 | 0.2377 | 0.2152 | 0.0380 |
| 6 | Public order | 0.1368 | 0.3689 | -3.0731 | 0.3811 | 0.3383 | 0.0128 |
| 7 | Burglary | 0.1204 | 0.1961 | -2.2737 | 0.2873 | 0.2758 | 0.0013 |
| 8 | Shoplifting | -0.1433 | 0.3306 | -0.5314 | 0.8746 | 0.2971 | 0.0224 |
| 9 | Robbery | 0.1478 | 0.2594 | -2.8326 | 0.3468 | 0.2865 | 0.0138 |
| 10 | Theft from the person | 0.0145 | 0.9840 | -15.1994 | 0.3664 | 1.3242 | 0.0385 |
| 11 | Other crime | 0.0292 | 0.4767 | 1.2377 | 0.1952 | -0.0330 | 0.3445 |
| 12 | Bicycle theft | 0.0021 | 0.9843 | -7.0480 | 0.0067 | 0.2389 | 0.0117 |
| 13 | Possession of weapons | 0.0260 | 0.0404 | -0.4068 | 0.1560 | 0.0238 | 0.0279 |
The results show consistency with initial drug crime model: income deprivation rate is not a good measure on its own and the strongly significant results found previously are likely evidence of omitted variable bias.
Instead, with all three metrics considered, deprivation gap (%) outperforms income deprivation rate and Moran's I in predicting crime rates. This suggests that, in general, boroughs with the most extreme divides between poor and rich areas can expect to see higher crime rates. This is true irrespective of the more general rate of deprivation, except for violence and sexual offences, possession of weapons, and ASB, which correlate with both the income deprivation rate and gap.
This could be interpreted in many ways, for instance, those suffering from income deprivation may be more aggrieved to their situation if living near the most affluent - although Moran's I is not a good estimator.
These results serve to justify that the link between inequality and crime is complex - i.e. lower incomes alone cannot be said to increase crime.
df3a = incidents.groupby(['Month'], as_index=False)['Crime Category'].value_counts()
## Sum the total number of crimes committed by month and crime type
alt.Chart(incidents.groupby(['Month'], as_index=False)['Crime Category'].value_counts()).mark_area().encode(
x = alt.X('Month:T', title=None),
y = alt.Y('count:Q', title=None),
color = "Crime Category:N"
).properties(
width = 700,
title = 'London: monthly crime incidents by type'
)
Asides from anti-social behaviour, this stacked area chart shows clear dips for the lockdown periods, first around April 2020 and then again towards the end of 2020 with the lockdown 2 (November 2020) and lockdown 3 from the start of January. Breaches of covid restrictions were generally recorded as anti-social behaviour and thus explains the sharp rise in ASB. Interestingly, this was less clear in lockdowns 2 and 3, suggesting some change either in behaviour or policing.
However, how this was spread among different crime types is still unclear.
For this, we consider the first lockdown period from April -> June 2020, as well as the three month periods either side of that.
## create filtered dataframes for each period
preL1 = crimes[(crimes['Month'] >= '2020-01') & (crimes['Month'] <= '2020-03')]
L1 = crimes[(crimes['Month'] >= '2020-04') & (crimes['Month'] <= '2020-06')]
postL1 = crimes[(crimes['Month'] >= '2020-07') & (crimes['Month'] <= '2020-09')]
df_preL1 = pd.DataFrame(preL1['Crime Category'].value_counts()).reset_index()
df_preL1['Period'] = 'Jan-Mar'
df_L1 = pd.DataFrame(L1['Crime Category'].value_counts()).reset_index()
df_L1['Period'] = 'Apr-Jun'
df_postL1 = pd.DataFrame(postL1['Crime Category'].value_counts()).reset_index()
df_postL1['Period'] = 'Jul-Sep'
periods_li = [df_preL1, df_L1, df_postL1]
df3a = pd.concat(periods_li, ignore_index=True)
df3a.columns = ['Crime Category', 'Crime Total', 'Period']
alt.Chart(df3a).mark_bar().encode(
x = alt.X('Period:O', title=None, sort=['Jan-Mar', 'Apr-Jun', 'Jul-Sep']),
y = alt.Y('Crime Total:Q'),
color = alt.Color('Period:N'),
column = alt.Column('Crime Category:N', title='London: effects of the first Covid-19 lockdown (2020)'),
tooltip = [alt.Tooltip('Period:N'), alt.Tooltip('Crime Total:Q', format=',')]
)
df3a.head()
| Crime Category | Crime Total | Period | |
|---|---|---|---|
| 0 | Violence and sexual offences | 56160 | Jan-Mar |
| 1 | Vehicle crime | 31759 | Jan-Mar |
| 2 | Other theft | 28134 | Jan-Mar |
| 3 | Burglary | 18510 | Jan-Mar |
| 4 | Theft from the person | 14908 | Jan-Mar |
This shows crimes of most types to fall or stay the same in the first Covid lockdown. Crimes relating to theft fell the most, between
Bicycle theft and drug offences are the only categories to show any noticeable increase, although the increase in bicycle theft continues drastically in the 3 months following, suggesting this could be a season effect. The sharp rise (roughly 30%) and then decrease of drug offences either side of the lockdown period suggests drug behaviour may have been considerably affected. However, again this may not infer a direct relationship (i.e. isolation increased drug taking etc), and could result from other effects on policing such as the ease of tracking criminals.
Violence and sexual offences remained at a similar level, with sharp rise following easing of restrictions. Many of its sub-categories (don't have data for), such as minor assult and harassment, could be expected to decrese considerably under lockdown. The lack of any significant change could be due to other crime types increasing during lockdown, such as domestic crimes.
Using a hexbin plot with matplotlib, the individual crime incidents can be binned and then displayed as a chloropleth map to identify crime hotspots.
fig, ax = plt.subplots(1,2, figsize=(18,6), sharey=True)
ax[0].hexbin(x=crimes[crimes['Month'] == '2020-04']['Longitude'], y=crimes[crimes['Month'] == '2020-04']['Latitude'])
ax[1].hexbin(x=crimes[crimes['Month'] == '2022-04']['Longitude'], y=crimes[crimes['Month'] == '2022-04']['Latitude'])
plt.show()
Plotting every crime incident for the first full lockdown month (April 2020) and April 2022 shows a massive change in geographic distribution.
The lockdown plot better resembles a standard population density visualisation, with dark areas for London's larger parks, rivers and reservoirs. This suggests that the crimes that did still happen during the lockdown were likely committed closer to perpertrator's residence.
This could partially support the idea that the policing of certain offences, such as drug crime, became easier during lockdown and thus explained the rise (or lack of fall) of incidents.
For this, we can look at how crime rates 'recovered' after lockdown and whether this was distributed evenly.
First, must find the monthly crime totals per borough, merge with the population data and then calculate crime rates. Then can filter for our chosen Covid period of January 2020 to July 2021.
## calculate monthly crime total per borough
m_count = crimes.groupby(['Borough', 'Month'], as_index=False)['Month'].value_counts()
## merge with population data and calculate crime rates
m_rate = pd.merge(m_count, pop_df, on=['Borough'])
m_rate['Crime Rate'] = ((1000*m_rate['count']) / m_rate['population']).round(2)
## filter df to remove City of London crimes
df_f = m_rate[(m_rate['Borough'] != 'City of London')]
# m_rate.drop(['count'], axis=1, inplace=True)
## filter dataframe for covid period: January 2020 -> June 2021
covid = df_f[(df_f['Month'] >= '2020-01') & (df_f['Month'] <= '2021-06')].reset_index(drop=True)
To evaluate whether pandemic effects have been equally distributed, we can consider a starting point from which each borough is equal. This starting point can be the index from which all later values are measured.
## sort deprivation data and remove City of London
f1 = dep_df.sort_values(by=['Income deprivation rate (%)'], ascending=False, ignore_index=True)
f2 = f1[(f1['Borough'] != 'City of London')]
## creates a list of most and least deprived boroughs
high_dep = list(f2.head(8)['Borough'])
low_dep = list(f2.tail(8)['Borough'])
## create list of unique boroughs to iterate over
boroughs = list(covid['Borough'].unique())
## calculate index for crime rates = 100 at t1
index = []
for b in boroughs:
subset = covid[(covid['Borough'] == b)].reset_index(drop=True)
indexrow = subset[:1] ## select first row (ie April 2020 value)
subset['Rate_indexed'] = (subset['Crime Rate']/ indexrow['Crime Rate'][0]) * 100
index.extend(list(subset['Rate_indexed']))
covid['Index'] = index
# merge with deprivation data
covid_df = pd.merge(covid, dep_df, on='Borough')
# filter for most and least deprived
tails_covid_df = covid_df[(covid_df['Borough'].isin(low_dep)) | (covid_df['Borough'].isin(high_dep))]
alt.Chart(tails_covid_df).mark_line(point='transparent').encode(
x = alt.X('Month:O', title=None, axis=alt.Axis(labelAngle=-30, labelOffset=12)),
y = alt.Y('Index:Q', title=None, scale=alt.Scale(domain=(20, 140))),
detail = alt.Detail('Borough:N'),
color = alt.Color('Income deprivation rate (%):Q', scale=alt.Scale(scheme='blueOrange')),
tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Month:N'), alt.Tooltip('Crime Rate:Q', format='.3', title='Crime Rate (per 1,000 people)'), alt.Tooltip('Income deprivation rate (%):Q'), alt.Tooltip('Index:Q', format='.3')]
).properties(
title = 'Average monthly crime rate during the pandemic',
width = 600,
height = 320
)
This graph shows the development of crime rates for a subset of London boroughs (8 most + 8 least income deprived). The initial Covid / lockdown shock is clear between March and April 2020, with crime rates falling by between 15-45%. Crime rates bottom out again in January 2021, the first month of the 3rd lockdown.
By incorporating a diverging colour scale amongst the boroughs, its clear that the affect on crime rates is unequally distributed. Boroughs with higher income deprivation (indicated by the orange lines) typically saw a smaller reduction in crimes rates during lockdowns. This continued in the 'return-to-normal' periods with high deprivation areas returning to near or above pre-pandemic levels within 2-3 months of the 1st and 3rd lockdowns, while low deprivation areas settled at crime rates roughly 10% (0-20%) below pre-pandemic levels.
To consider any differences in impact on trends, we can consider the 6-month periods leading to the first lockdown, and the 6-month period following the lifting of all covid restrictions.
## filter new dataframes for before and after covid period
pre_covid = df_f[(df_f['Month'] >= '2019-10') & (df_f['Month'] <= '2020-03')].reset_index(drop=True)
post_covid = df_f[(df_f['Month'] >= '2021-07') & (df_f['Month'] <= '2021-12')].reset_index(drop=True)
# calculates the monthly mean crime rate across the 6 month period, and merges with deprivation data
pre_df = pd.merge(pre_covid.groupby('Borough', as_index=False)['Crime Rate'].mean(), dep_df, on='Borough')
post_df = pd.merge(post_covid.groupby('Borough', as_index=False)['Crime Rate'].mean(), dep_df, on='Borough')
Can use the lists of most and least deprived boroughs to filter the dataset
low_pre_df = pre_df[(pre_df['Borough'].isin(low_dep))]
high_pre_df = pre_df[(pre_df['Borough'].isin(high_dep))]
low_post_df = post_df[(post_df['Borough'].isin(low_dep))]
high_post_df = post_df[(post_df['Borough'].isin(high_dep))]
Lastly, calculate the mean crime rate among the most and least deprived boroughs.
low_diff = (low_post_df['Crime Rate'].mean() - low_pre_df['Crime Rate'].mean())/low_pre_df['Crime Rate'].mean()
high_diff = (high_post_df['Crime Rate'].mean() - high_pre_df['Crime Rate'].mean())/high_pre_df['Crime Rate'].mean()
print(f'Change in crime rate amongst most deprived boroughs: {high_diff: .2%}')
print(f'Change in crime rate amongst least deprived boroughs: {low_diff: .2%}')
Change in crime rate amongst most deprived boroughs: -0.71% Change in crime rate amongst least deprived boroughs: -6.86%
This shows the difference in average monthly crime rate for 6 month period prior to and following the pandemic (Apr 2020 - Jun 2021).
The most income deprived boroughs have seen a negligent change in crime rates, while the least income deprived have seen an almost 7% reduction in crime. This decrease in crime rate could be representive of a long-term trend and thus not a direct result of the pandemic. However, the failure for all boroughs to match this trend strongly suggests an unequal distribution of detrimental effects resulting from the pandemic (and its associated impacts on wellbeing, livelihood etc).
Whatever factors are underlying this difference, structurally different areas will always vary in their sensitivity to exogenous shocks. This could also example how sometimes equal treatment (i.e. through income support schemes etc) can actually be regressive.
pre_covid.groupby('Borough', as_index=False)['Crime Rate'].mean()
pre = alt.Chart(pre_df).mark_circle().encode(
x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 24])),
y = alt.Y('Crime Rate:Q', title=None, scale=alt.Scale(domain=[0, 35])),
detail = 'Borough:N',
color = alt.Color('Income deprivation rate quintile:N'),
tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Crime Rate:Q', format='.3', title='Crime Rate (per 1,000 people)'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
title = 'Average crime rate pre-Covid'
)
post = alt.Chart(post_df).mark_circle().encode(
x = alt.X('Income deprivation rate (%):Q', scale=alt.Scale(domain=[4, 24])),
y = alt.Y('Crime Rate:Q', title=None, scale=alt.Scale(domain=[0, 35])),
detail = 'Borough:N',
color = alt.Color('Income deprivation rate quintile:N'),
tooltip = [alt.Tooltip('Borough:N'), alt.Tooltip('Crime Rate:Q', format='.3', title='Crime Rate (per 1,000 people)'), alt.Tooltip('Income deprivation rate (%):Q')]
).properties(
title = 'Average crime rate post-Covid'
)
pre | post
Filter for richest and poorest boroughs, what is the difference in average crime rates pre & post-covid
Create table with monthly crime rates around lockdowns
Flaws in analysis:
To better compare boroughs, this adjusts for population differences using 2021 Census population estimates. This is used across the whole time period (imperfect as population is dynamic). It also does not account for the larger daytime populations of the inner London boroughs.
Write a summary of what you've learned from the analysis.Share ideas for future work on the same topic using other relevant datasets/sourcesFinally, we found that lockdowns have disproportionately affected different London boroughs, with the richest and poorest boroughs seeing clear divergence in crime rates.
Further ideas:
Another interesting dataset is for police stop and search. This is again recorded with (rough) coordinate locations, but also includes demongraphic features. A potentially idea could be looking at the clustering of crimes and seeing if stop and search corrosponds to these metrics: i.e. how targeted is stop and search?